Skip to content

Conversation

@arnaud-lb
Copy link
Member

@arnaud-lb arnaud-lb commented Sep 4, 2025

This is a PoC of making the cycle collector generational.

In some applications, there is an object that transitively references everything (e.g. the DI container or the UnitOfWork), and many objects transitively reference this object back. As a result, most GC runs in these applications will end up scanning the entire application. This defeats the premise of the cycle collector.

Based on that, a generational cycle collector should improve throughput to some degree.

Unlike a tracing GC, we don't need to keep track of created inter-generational references, so this turns out relatively simple.

  1. Nodes that are scanned but not collected during a run are marked as OLD
  2. There are two kinds of GC runs: Partial and full. Partial runs ignore OLD nodes during trial deletion and other steps.
  3. gc_possible_root() adds OLD nodes to a separate buffer that doesn't count towards the threshold
  4. At the end of a partial run, OLD roots are moved to a separate buffer. This buffer is appended to the roots before full runs.

This is enabled by setting the environment variable FULL_GC_FREQ to a non-zero value (e.g. 10 will run a full GC every 10 runs).

I've used the following script to test this PoC: https://gist.github.com/arnaud-lb/6bfb493f361e056979571941f1b85990. This simulates a typical Doctrine or Symfony application with a very large DI container or UnitOfWork; a worse-case scenario for the GC.

With FULL_GC_FREQ=10, the benchmark runs 10-20% faster when the root object (the Tree class) is large, or it creates garbage often (keeping the threshold low). The benchmark runs slower when the root object is relatively small and it does not create garbage often, as in comparison the non-gen GC will keep the threshold high in this case. This could be mitigated by improving how the threshold is adjusted in the gen GC. In the first case the non-gen GC is slow because of the low threshold, so we may achieve similar results in the non-gen GC with better adjustment of the threshold, at the cost of higher peak mm usage than the gen GC.

Of course this is only one benchmark. This needs to be tested on real applications.

This should improve throughput and average pause time, but not maximum pause time.

Related: #17131

Implementation details:

  • Introduced a new type_info flag: GC_OLD. This flag is not part of GC_INFO_MASK, so it's ignored by GC_INFO_MASK() / GC_REMOVE_FROM_BUFFER() / GC_REF_SET_INFO() and persists across GC runs. Caveat: Had to reclaim a bit from GC_ADDRESS.
  • Added a separate buffer for old roots.
  • gc_mark_roots() either ignores OLD nodes (in partial runs), or removes the OLD flag (in full runs)
  • gc_scan_roots() always ignores OLD nodes
  • gc_collect_roots() moves OLD roots to the old roots buffer (unless this is a full run, as in this case we have proven these roots are not part of a garbage cycle).

@dktapps
Copy link
Contributor

dktapps commented Sep 5, 2025

Haven't given much thought to this, but have you considered having it only do a full GC run when e.g. the collector didn't manage to collect some minimum number of cycles? Though I suppose this would still penalise applications which create very few cycles...

In my particular case it'd be preferable to avoid full GC entirely. Realistically I'd still end up doing manual GC triggers anyway (so my application would have control over when to do a full vs incremental GC) but it's probably worth thinking about what conditions should trigger a full automatic GC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants